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Abstract 

When can reliable inference be drawn in the “Big Data” context? This paper presents a 
framework for answering this fundamental question in the context of correlation mining, 
with implications for general large scale inference. In large scale data applications like 
genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample- 
starved: a regime where the number n of acquired samples (statistical replicates) is far fewer 
than the number p of observed variables (genes, neurons, voxels, or chemical constituents). 
Much of recent work has focused on understanding the computational complexity of proposed 
methods for “Big Data”. Sample complexity however has received relatively less attention, 
especially in the setting when the sample size n is fixed, and the dimension p grows without 
bound. To address this gap, we develop a unified statistical framework that explicitly quanti¬ 
fies the sample complexity of various inferential tasks. Sampling regimes can be divided into 
several categories: 1) the classical asymptotic regime where the variable dimension is fixed 
and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable 
dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional 
asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. 
Each regime has its niche but only the latter regime applies to exa-scale data dimension. 
We illustrate this high dimensional framework for the problem of correlation mining, where 
it is the matrix of pairwise and partial correlations among the variables that are of interest. 
Correlation mining arises in numerous applications and subsumes the regression context as 
a special case. We demonstrate various regimes of correlation mining based on the unifying 
perspective of high dimensional learning rates and sample complexity for different structured 
covariance models and different inference tasks. 

Keywords: large scale inference, Big Data, sample complexity, asymptotic regimes, 
purely high dimensional, unifying learning theory, triple asymptotic framework, correlation 
mining, correlation estimation, correlation selection, correlation screening, graphical models. 
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1 Introduction 


The increasing availability of large scale and high dimensional data is driving a major resur¬ 
gence of data science, recently rebranded under the moniker “Big Data” [90]. There has 
been a preponderance of catch phrases such as “big-data-biology,” “ecoinformatics,” “pre¬ 
cision medicine,” “data-driven decision-making,” “Big Data business analytics” in scientific 
publications and the media. However, until recently, most of the research in Big Data has 
concentrated on issues of data management, data warehousing, computational data analysis, 
and end-user data utilization [ 129] , [93], [96]. While the data management research commu¬ 
nity has made progress on the problem of quality assurance, e.g., associated with provenance 
and computer errors |80| . the issue of limited sample size and statistical reproducibility re¬ 
mains largely open. This issue has been recognized as one of the principal hurdles that stand 
in the way of success of the scientific enterprise [154] . [ 1T6] . The negative consequences of 
insufficient samples can be especially dire when the data is high dimensional, heterogenous 
and uncalibrated [26], [HI]. It is therefore both important and timely to address the problem 
of inference on Big Data from the point of view of statistical reproducibility. 

The statistical reproducibility point of view is founded on a non-monolithic notion of Big 
Data: the data should be considered as a matrix with p columns and n rows indexed by, 
respectively, p variables over each of the data fields and n independent samples of these vari¬ 
ables. If there is only a single sample then the matrix collapses to a vector and no statistical 
analysis of reproducibility can be performed. With larger sample size, the probability of 
reproducibility can be studied in the context of the statistical theory of random matrices. 

Here we develop this perspective for a particular Big Data problem: correlation mining 
in high dimension where the number of samples is much smaller than the dimension, a 
setting that we call “sampled starved.” Correlation mining is an area of data mining where 
the objective is to discover patterns of correlation between a large number (p) of observed 
variables based on a limited number (n) of samples. Correlation mining can be framed as the 
mathematical problem of reliably reconstructing different attributes of the correlation matrix 
of the population from the sample covariance matrix that is empirically constructed from 
the p x n data matrix. The inference task depends on the attributes of the correlation that 
are of interest, while the performance of a correlation mining algorithm for a particular task 
depends on the number of samples and the underlying structure of the population covariance. 
In high dimensional settings where p^> n, correlation mining presents significant challenges 
to the practitioner, both in terms of unavailability of computationally tractable algorithms 
and in terms of lack theory that could be used to specify sample size requirements. This 
paper will provide some perspective on the latter challenges. 

Covariance matrices arise in a very diverse set of Big Data applications including: em¬ 
pirical finance and econometrics [84], [85], [86], m, m ,» MIMO radar and commu¬ 
nications [54], [51], [77], [13], [48], [89], [62] , [12]; image analysis [49] .[92]. [18] . [152] : network 
sensing [137] . [T7] . [102] . [103] . [16] . [21] : life and biomedical sciences [HI], [136] . [2], [118] . 
HH], H55], [106] ; and climate science m , m, ra, m \, to name just a few. However, 
covariance matrices are used differently depending on the application and the task. For 
tracking of targets from space-time-adaptive-radar (STAP) the task is to estimate the entire 
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covariance matrix in so far as it yields estimates of eigenvectors that span the signal (target) 
subspace [53]. For exploring functional gene regulation networks the task is to determine the 
matrix locations of the largest pairwise correlations or inverse correlations, often obtained 
by thresholding the sample correlation matrix [61] or by performing sparsity constrained op¬ 
timization 0 . m ■ In linear discriminant analysis the task is to estimate a quadratic form 
of the inverse covariance matrix [73], [37]. In anomaly detection, it is the Schur complement 
of the covariance matrix that is of interest as it is related to residual prediction error covari¬ 
ance [150] . In independence screening one is interested in the support of the non-zero rows 
of the covariance [36] . In variable selection for prediction it is the support of the regression 
coefficient vector that is of interest [37] (a setting in which correlation mining subsumes the 
regression context as a special case). 

Of central importance for all of these applications are the sample size requirements, which 
can differ from task to task. Reliably performing some of these correlation mining tasks might 
require relatively few samples, e.g., screening for the presence of variables that are hubs of 
high correlation in a sparsely correlated population, while other tasks might require many 
more samples, e.g., accurately estimating all entries of the inverse covariance matrix in a 
densely correlated population. A theoretical framework has been emerging for predicting 
these sampling requirements as a function of the population correlation structure and as a 
function of the inference task. The principal aim of this paper is to present building blocks 
of this framework with an emphasis on the high dimensional setting where the number p 
of variables is much larger than the number n of samples (p 3> n), a setting relevant to 
correlation mining in massive data sets. 

Several real-world examples of this high dimensional setting are given below: 

• In studies of correlation networks of the annotated human genome p is on the order 
of tens of thousands of genes while n is typically fewer than a hundred samples, e.g., 
corresponding to a population of human subjects. In these studies correlation levels 
of magnitude as low as 0.2 are sometimes considered significant pi, though some of 
them may actually be spurious. 

• In space-time-adaptive-processing (STAP) radar a spatio-temporal covariance matrix 
is used to filter out clutter in order to better detect a moving target at a particular 
range and Doppler frequency. For full degrees-of-freedom (DOF) STAP the estimator 
of the spatio-temporal clutter covariance can have dimension p = rq on the order of 
hundreds of thousands and the number of samples n under a hundred. Here r is the 
number of radar pulses (time), q is the number of elements in the radar array (space), 
and n is the number of range bins in the vicinity of the target [53].[55]. 

• For spambot network discovery from honeypot data, correlation of spamming profiles 
is performed between p = hundreds of thousands of known IP addresses while the 
number of time points in each profile may only be on the order of n = 30 [153]. 

• In recommender systems preference vectors o( p — millions of subscribers are correlated 
with other subscribers on the basis of n = a few hundred preference categories, e.g., 
movies or music, to predict future user preferences SHI, 0. 
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• In fMRI connectomics a long term goal is to correlate brain activations across individual 
brain neurons, i.e., p = 10 11 . Currently connectome researchers might parcel the 
brain into p regions, with p = several thousand, and estimate correlation by averaging 
over n = hundreds of repeated activation stimuli. The objective is to use the sample 
correlation matrix to study patterns of activation in order to reveal the “brain network” 
|125j, |133j. 

Therefore, it is not an exaggeration to say that correlation mining practitioners face a 
deluge of variables with very limited sample size. With so few samples one is bound to find 
spurious correlations between some pairs of the many variables. It is therefore essential to 
understand the intrinsic sampling requirements of such statistical inference problems. The 
study of sampling requirements falls into several different asymptotic regimes. Classical 
statistical error prediction and control methods are based on an asymptotic regime where 
the dimension p is fixed and the sample size n goes to infinity. Such a regime is obviously 
inapplicable to the case of n < p. Recently, statisticians have developed theory that applies 
to the regime where p goes to infinity. This theory is in the realm of high dimensional 
statistical inference, often called the “large p small n regime” HH, S3- This theory covers 
the case where both n and p go to infinity, which, as contrasted to the classical fixed p 
regime, is a setting that we called the mixed asymptotic regime and includes the so-called 
“high dimensional,” “very-high dimensional” or “ultra-high dimensional” settings na. ra. 
Each of these denote regimes where the speed at which n goes to infinity as a function of p 
becomes progressively slower. However, since the theory requires that both p and n go to 
infinity it may not be very useful in applications where the availability of samples is limited 
and finite. In such a case a more useful and relevant regime is the “purely-high dimensional” 
setting where the number of samples n remains fixed while the number of variables p goes 
to infinity. This setting applies, for example, to the correlation screening application [61], 
discussed in Sec. [5] where the correlation threshold p approaches one as p becomes large. 
This regime is in fact the highest possible dimensional regime, short of having no samples at 
all. Thus it is appropriate to call this purely-high dimensional regime the “ultimately-high 
dimensional regime,” and we shall use these two terms interchangeably in the sequel. A 
table comparing these asymptotic regimes is given below (Table [I]). 

The purely-high dimensional regime of large p and fixed n is a mathematical character¬ 
ization of the extremely big data problem. Evidently, this regime poses several challenges 
to correlation mining practitioners. These include both computational challenges and the 
challenge of error control and performance prediction. Yet this regime also holds some rather 
pleasant surprises. Remarkably, there are surprising benefits to having few samples in terms 
of computation and scaling laws. There is a scalable computational complexity advantage 
relative to other high dimensional regimes where both n and p are large. In particular, 
correlation mining algorithms can take advantage of numerical linear algebra shortcuts and 
approximate k nearest neighbor search to compute large sparse correlation or partial corre¬ 
lation networks. Another benefit of purely-high dimensionality is an advantageous scaling 
law of the false positive rates as a function of n and p. Even small increases in sample size 
can provide significant gains in this regime. For example, when the dimension is p — 10, 000 
and the number of samples is n — 100, the experimenter only needs to double the number 
of samples in order to accommodate an increase in dimension by six orders of magnitude 
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Asymptotic 

framework 

Terminology 
( “setting” ) 

Sample 

size 

Model 

dimension 

Application 

domain 

References 

(selected) 



n 

P 



Large 

sample 

asymptotics 

(Classical) 

small 

dimensional 

- > oo 

fixed 

“small data” 

Fisher |42||43|, Rao 108| 1091, 

Neyman and Pearson I99J, Wilks 11511 , 
Wald [14 T\ 142 143 144], 
Cramer|24l [531, Le CamT^I 1831, 
Chernoff |20|, Kiefer and Wolfowitz|76|, 
Bahadur [S], Efron 1351 

Mixed 

asymptotics 

high 

to 

ultra high 
dimensional 

- > oo 

-> oo 

“medium sized” data 
(mega or giga scales) 

Donoho |33|, Zhao and Yu 11551, 

Meinshausen and Biihlmann 1951. 

Candes and Tao fc 161, Wainwright 11381 [T39I, 
Bickel, Ritov, and Tsybakov|10|, 

Peng, Wang, Zhou, and Zhu 11041, 

Khare, Oh, and Rajaratnam, |72|, 

Fan and Lv 1361 

Large 

dimension 

asymptotics 

purely high 
dimensional 

fixed 

- > oo 

Sample Starved 
“Big Data” 

(tera to exascales) 

Hero and Rajaratnam 1611. 

Hero and Rajaratnam [58], 

Firouzi, Hero and Rajaratnam 1381 


Tabic 1: Overview of different asymptotic regimes of statistical data analysis. These regimes are 
determined by the relation between the number n of samples drawn from the population and the 
number p, called the dimension, of variables. In the classical asymptotic regime the number p is 
fixed while n goes to infinity. This is the regime where most of the well known classical statistical 
testing procedures, such as student t tests of the mean, Fisher F tests of the variance, and Pearson 
tests of the correlation, can be applied reliably. Mixed asymptotic regimes where n and p both go to 
infinity have received much attention in modern statistics. However, in this era of big data where 
p is exceedingly large, the mixed asymptotic regime is inadequate since it still requires that n go 
to infinity. The recently introduced “purely high dimensional regime ” m, m addressed in this 
paper, is more suitable to big data problems where n is limited and finite (adapted from I63f) 


(p = 10,000,000,000) without increasing the false positive rate [58] . 

The unifying comparative tool used in this paper is sample complexity analysis. We will 
use this analysis to place different statistical models and different inference tasks into the 
various asymptotic regimes shown Table [I] (see Tables [2]{3]). Sample complexity analysis 
has been widely applied to study the performance of inference procedures. In supervised 
computational learning sample complexity is defined as the number of training samples 
necessary to maintain a given level of generalization error as a function of the complexity of 
the statistical model or algorithm m ra, m, m ■ Similar notions arise in the context 
of high dimensional convergence rates [15], Bayesian model complexity [ 124 ]. and stochastic 
complexity (minimum description length - MDL) mug There is always a tradeoff between 
statistical sample complexity and statistical model complexity. The nature of this tradeoff 
is specific to the model and to the task. This specificity forms the basis for the unifying the 
theory of large scale inference that is developed in this paper. 

The outline of the paper is as follows. Sec. [2] reviews the definitions of covariance, 
precision, correlation, partial correlation, and Gaussian graphical models. Section [3] treats 
the problem of covariance estimation. Section [4] develops the problem of model selection and 
support estimation. In Sec. [5] the problem of correlation screening is discussed. In Sec. [6] 
sample complexity comparisons of various types of correlation inference tasks are presented. 
Concluding remarks are given in Sec. [7] 

cl Stochastic complexity is different from the notion of statistical complexity in statistical mechanics 1251 
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Correlation matrices and correlation networks 


2.1 Covariance and precision matrices 

Let X = [xi,... ,x p } T £ R p be a real-valued random (column) vector having a distribution 
whose second order moments exist. The mean vector pi = E[X] £ R p and the covariance 
matrix E = cov(X) = A[(X — /x)(X — pi) T ] £ R pXp are defined in terms of the statistical 
expectation operator E[-\ associated with the distribution of X. The matrix E is symmetric 
positive semidefinite and its if th entry is the covariance cov(Xj, Xj) = E[{Xi?~ — fif], 

with /ii the z-th element of pi. If will be assumed throughout that E is in fact positive definite, 
which is generally true when the p components of X are non-redundant random variables, 
i.e., no variable can be predicted without error using linear combinations of the other vari¬ 
ables. The inverse of the covariance matrix is called the precision matrix 12 = E^ 1 . The 
covariance and precision matrices capture marginal dependency and conditional dependency, 
respectively, in the joint distribution of X. In particular, when E (12) is sparse, i.e., it has 
only a few non-zero elements, there are many variables that are marginally (conditionally) 
independent of the other variables. More will be said about this in Sec. m 

The matrices E and 12 are not invariant to scaling of the component variables of X. Scale 
invariance is important when one is interested in a measure of dependency that is not a 
function of the units used to represent different variables. The correlation matrix R and 
partial correlation matrix P are respective versions of E and 12 that are scale invariant 

R = diag(E)- 1 / 2 Sdiag(E)- 1 / 2 (1) 

P = diag(12) -1 / 2 12diag(12) -1 / 2 (2) 

where, for a p x p matrix B, diag(B) is the p x p diagonal matrix formed from the diagonal 
elements of B and, for a diagonal matrix D with non-zero valued entries {da}? =1 along the 
diagonal, D~ 1//2 denotes the diagonal matrix with entries {1 /\/^}f=i along the diagonal. 
The diagonal elements of R and P are equal to 1 and the off-diagonal elements are between 
—1 and 1. An important property of R and P is that they retain any zero patterning in E 
and 12, respectively, if E or 12 are sparse. 

Let {Xfc}^ =1 be a set of n i.i.d. realizations from the distribution of X. The sample 
covariance matrix is the p x p positive semidefinite matrix 

1 n 

Sn =- T V(x fc - £)(X fc - fi) T (3) 

fc =0 

where pi = n~ l £L,x* is the sample mean vector. The sample covariance S n is a sta¬ 
tistically consistent estimator of E, i.e., it converges to E with probability one as n —> oo 
if p is fixed. Furthermore, the plug-in estimator R = diag(S n ) -1 / 2 S n diag(S n ) -1 / 2 can be 
used to estimate R. The matrix S n is positive definite with probability one if n > p and 
if the distribution of X is Lebesgue continuous. In this case, plug-in estimators 12 = S” 1 
and P = diag(12) -1 / 2 12diag(12) -1 / 2 can be used to consistently estimate 12 and P. In the 
case n < p, which is the main focus of this paper, alternative estimation strategies must be 
adopted. 
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2.2 The Gaussian graphical model 


Assume that the data is a set of n i.i.d. realizations {Xfc}]J =1 of a random vector X G R p 
that has a multivariate Gaussian distribution with positive definite covariance matrix X. 
When the precision matrix il is sparse such data is said to be a Gauss Markov random field 
(GMRF) or, equivalently, is said to be distributed according to a Gaussian graphical model 
(GGM). The log-likelihood function for X and Q can be expressed as 


m = 

-|tr{E 'S n } - ^logdet(X), 

(4) 

l{ O) = 

77 77 

--tr{ftS n } + — logdet(fi), 

(5) 


up to unimportant additive constants. In the GGM Q is a more natural parameterization 
than X in the sense that it is sparse while E may not be sparse. In addition, the Gaussian 
Graphical model corresponds to a natural exponential family with il as the canonical param¬ 
eter. Furthermore, the sparse structure of f2 is directly related to conditional independencies 
specified by the pairwise, local and global Markov properties of undirected graphical models 
and also by the factorization of the joint density of X S1!. j 1 10 . 

The GGM is a “graphical model” in the following sense. Define a graph Q whose p vertices 
(or nodes) correspond to the p variables X \,..., X p in X. Assign an edge between node i 
and node j if and only if the ij- th entry of the sparse precision matrix Q is non-zero. The 
edges of the so-constructed graph Q , often called a correlation or partial correlation graph, 
are specified by an adjacency matrix A. A is a binary matrix that indicates the support 
of the matrix Q; specifically the non-zero entries of A indicate the locations the non-zero 
entries of fl. The degree of node i is the number of neighboring nodes in the graph and 
corresponds to the number of non-zero entries in the i-th row of fl. In GMRFs it is typically 
assumed that the graph Q has only a small number (0(p)) of edges, corresponding to the 
property f2 is row spa rscp] [81J. 

Observe that in a GGM it is the inverse covariance f2 and not the covariance itself that 
is important. There are many other reasons for the wide interest in f2 and its scaled version 
P. the partial correlation matrix (J2]) » 0. EH, pi, urn 03, [69] . First, given predictor 
variables X = [Xy,.. ., X n ] T , the optimal linear predictor of a response variable Y depends 
on cov(X) only through its inverse. Second, when X is Gaussian, the Bayes-optimal test 
between two hypothesized covariances similarly depends only on the inverse covariances. 
Third, many classical multivariate procedures such as linear discriminant analysis (LDA) 
and multivariate analysis of variance (MANOVA) depend directly on the inverse covariance 
matrix. Fourth, the ij- th entry of P is equal to the conditional correlation between Xy and 
Xj given the rest of the variables {AFifth, if X t is the optimal linear predictor of Xy 
given the ij- th entry of the partial correlation matrix P is equal to the correlation 

coefficient between the prediction error residuals associated with X % and X r Finally, many 
physical random processes have a sparse inverse covariance matrix while the covariance 
matrix is not sparse. We illustrate this last point by the following example. 

cl A px p matrix is row sparse with sparsity coefficient s if no row has more than s non-zero entries, where 
s = o(p). 
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2.3 Illustrative example of a GGM 


We illustrate the differences between the covariance and inverse covariance matrices in a 
GGM using an example motivated by statistical physics. Let X(ui,u 2 ) be a function on the 
unit square [0, l] 2 C It 2 that satisfies the Poisson partial differential equation 

AX(ui,u 2 ) = W{ui,u 2 ), u±,u 2 G (0, l) 2 
X(u 1 ,u 2 ) = 0, ui, u 2 4 (0, l) 2 


where W (ui, u 2 ) is a fixed integrable function. Here A A" = 4^4 + ^4- is the Laplacian differ¬ 
ential operator. Solutions X to the Poisson equation arise in heat transfer, electromagnetics, 
and fluid dynamics m ■ 


We extract a Gaussian graphical model from the Poisson equation by discretizing it to 
a finite difference equation and setting the discretized W to be a Gaussian random matrix. 
This will convert the continuous domain function X over [0, l] 2 to a vector X with p = N\N 2 
elements, where N { = 1 /Si and N 2 = 1 jS 2 and hi, S 2 are the U\ and u 2 increments used in the 
finite difference approximation. Specifically, discretize U\ and u 2 to the sets U\ G 
and u 2 e {kS 2 }^ 0 . Then, assuming smooth W, the discretized X satisfies the finite difference 
equation: 


Xij = 


+ (Ajj +1 + Xij i)5i 


WijSrfz 


2 (h 2 + h|) 


( 6 ) 


Second, arrange the (Ainto the vector X e R A in lexicographic order. This results 
in an equivalent matrix form of the equation (|6]) 


[I - A]X = W 


where A is a sparse tri-diagonal matrix. 

Under the assumption that W is a i.i.d. zero mean spatio temporal Gaussian driving 
noise, i.e., cov(W) = cr^I, the covariance E and the inverse covariance fl of X have the 
form 

S = f 2 - 1 , n = [I-A][I-A Y/a^. 

As A is sparse tri-diagonal, the inverse covariance f2 and partial correlation P are both sparse 
penta-diagonal matrices with support set shown in Fig. [I] for Ni = N 2 = 5. Therefore, the 
random vector X is a Gauss Markov random field with GGM spatial dependency structure 
specified by the non-zero entries of f2, equivalently of P. Note that the covariance matrix X 
is not sparse as the inverse of the penta-diagonal matrix is a full matrix. Thus, as is usually 
the case for a GGM, the inverse covariance is a more parsimonious model description than 
is the covariance. 

For visualization, two realizations of X are shown in Fig. [2] for a simulation of the 
discretized Poisson equation (J6| on a 30 x 30 spatial grid (p = 900). While general methods 
of support set estimation are discussed in Sec. i Fig. 1 illustrates a simple empirical 
estimator of the support set of f2 based on n = 1500 samples of X. This simple estimator 
is constructed by thresholding an estimate P of the partial correlation matrix P at a level 




p G [0,1], which is user defined. The non-zero entries of the adjacency matrix associated with 
the thresholded P specifies the support set estimator. Here P = diag(17) “ 1 /2 17diag (17) _ ^, 
where £7 = S” 1 is the inverse of the sample covariance S n defined in (31). Figure p shows 
the support set estimator for six different values of the applied threshold p. Note that when 
p = 0.26114 the true support set, illustrated in Fig. [T| for a 5 x 5 grid, is recovered. When 
p decreases below a critical value there is an abrupt increase in the number of false edges. 
Below this critical value (top left panel of Fig. [3]) the number of false edges approaches the 
number (jj) of edges in the complete graph. 



Figure 1: The Gaussian graphical model obtained from a finite difference approximation to the 
Poisson equation has only local spatial dependency (left panel) resulting in a pentadiagonal inverse 
covariance matrix (right panel). 


3 Correlation mining: estimation 

An ambitious correlation mining objective is to empirically estimate each and every one of 
the entries of the covariance X or inverse covariance matrix 17 from the n samples. The 
accuracy of these estimators is often measured by the Frobenius norm of the estimation 
error. Other common criteria for gauging estimator accuracy are the matrix i 2 norm, also 
called the operator norm, and the matrix t\ norm. For example, in estimation of 17 the 
Frobenius norm difference between 17 and its empirical estimate 17 is: 

v 

y ^ ( hJij (hjj)", 

i,j= 1 

where 0 Jij denotes the ij -th entry of 17. When the objective is to estimate the entries of 
the covariance, the correlation or the partial correlation matrix, the Frobenius norm error is 
defined similarly. Covariance and inverse covariance estimation arise in many applications 
including: finance (86], [50j; gene expression modeling [157]. [ 115] : sensor array processing 


17 — fl\\ F 
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Figure 2: Two (n = 2) independent realizations of the discretized Poisson random field obtained 
from the finite difference approximation |6p with N\ = IV 2 = 90. Here the driving process W is a 
zero mean spatio-temporally white Gaussian noise. 

ram [66] : space-time-adaptive processing (STAP) radar [149} , [53] [ 126] ; spatio-temporal 
classification ra and prediction [130 ]: brain connectomics m,m-, and sensor network 
anomaly detection [l9j| . 

When the data are multivariate Gaussian, the maximum likelihood (ML) estimator of fl 
maximizes the log-likelihood function (J5]) . In the case when n > p the ML estimator of the 
covariance E is equal to E = where S n is the sample covariance matrix defined in (3), 

and the ML estimator of fl is Q = E _1 . In high dimensional situations where n < p , S n is not 
positive definite and these ML estimators do not exist. Estimator regularization is commonly 
used in this situation in order to yield a positive definite solution. Model-based regularization 
methods add a suitable penalty function c(f2) to the log-likelihood function resulting in the 
so-called penalized ML estimator. When the penalty is interpreted as a log prior on Q 
the penalized ML estimator is equivalent to the maximum a posteriori (MAP) estimator, 
also known as the posterior mode estimator. Commonly used regularization penalties are 
the quadratic penalty A||f2||^ and the non-sparsity penalty A||f2||i = A l^ij Ij where 
A > 0 is called the regularization parameter, and pushes the penalized ML solution towards 
a diagonal matrix as A increases to 00 . Other penalties shrink the penalized ML solution 
towards more highly structured matrices, e.g., block sparse, Toeplitz, banded, circulant, or 
Kronecker matrix structure. We call the latter structured penalties as contrasted to the 
un-structured penalties that encourage the estimator to be diagonal. When the true inverse 
covariance fl has the same structure as the structure induced by the penalty we say that the 
penalty of the penalized ML estimator is matched to the model. 

The effect of different structured models for Q on the asymptotic convergence rate of the 
penalized ML error can be of interest to practitioners. In large scale settings the high dimen¬ 
sional convergence rate is a natural comparative performance measure. High dimensional 
rates of convergence have been obtained under a large range of structured and unstructured 
penalties (and matched models). Often, when the penalty is matched to the model, the 
Frobenius norm estimation error decreases at asymptotic rate of the form ^/P(logp)/n as 
p, n —>■ 00 where P is the (effective) number of free parameters in the model. We illustrate 
for the case of spatio-temporal multivariate Gaussian graphical models; a poster child for 
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p=0.0791 


p=0.13978 


p=0.20046 



Figure 3: The support of the thresholded sample partial correlation matrix, rendered as a graph 
over spatial locations in the plane, for a Gaussian random field generated by the Poisson equation 
on a 30 X 30 grid (p = 900J. The sample partial correlation P = diag(fi) _1 / 2 f2diag(f2) _1 ' 2 , where 
f! = S” 1 the inverse of the sample covariance matrix, was computed using n = 1500 samples. The 
graph is shown for six different threshold levels p applied to P. The true cartesian support set is 
recovered for p = 0.26114. There is a sharp increase in the number of false edges as the threshold 
p decreases below a certain threshold, somewhere between 0.0791 and 0.13978. The location of this 
threshold appears well approximated by the theory m that predicts a critical threshold value 0.091 
(See equation 0 in Sec. [5| of this paper). 
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high dimensional inference. 


3.1 Illustration: estimation of a spatio-temporal precision matrix 

Consider the case where a set of q sensors acquires r time samples (a snapshot) of a spatio- 
temporal random held, e.g., the STAP radar example [53].[59] discussed in the introduction. 
There are n snapshots of this random held. Each snapshot generates a multivariate Gaus¬ 
sian distributed random matrix X with q rows (spatial) and r columns (temporal). Define 
the lexicographically ordered vectorization X = vec(X) e H qr of X. The spatio-temporal 
covariance cov(X) = X of a snapshot is an unknown p x p matrix where p = qr. The task is 
to estimate the precision matrix f2 = X -1 . When there is no additional information about 
X, beyond that it is symmetric and positive definite, the precision matrix f2 is completely 
unknown and there are P = (!() = 0(p 2 ) unknown parameters to be estimated, which is 
the number of distinct entries of the matrix. The base case of no information is often called 
the saturated model. If the experimenter is given additional contextual information about 
the structure of the precision matrix the value of this information can often be quantified in 
terms of sample complexity. We give several examples below. 

Spatio-temporal contextual information: sparse precision matrix: Contextual in¬ 
formation specifying that X is a GGM with sparse inverse covariance matrix results in a 
significant reduction in the number of free parameters in Q: from P = 0(p 2 ) to P = 0(p). 
The problem of sparse covariance estimation has been of significant interest h. a, eg. 
When the spatio-temporal matrix X represents the outputs of sensor network, an example 
of contextual information is the knowledge that the physical environment is a time-varying 
random held whose amplitudes obey the laws of Lagrangian classical mechanics, e.g., fluid 
how, heat how, or electro-magnetic wave helds that satisfy Poisson or Navier-Stokes partial 
differential equations [117] . The contextual information may simply be an upper bound on 
the sparsity parameter s or it may actually specify the sparsity pattern, i.e., hx the graph Q 
specifying the support of f2. 

Spatio-temporal contextual information: Kronecker covariance: Any spatio-temporal 
covariance has some degree of Kronecker structure that can be captured by the Kronecker 
sum decomposition of Pitsianis and Van Loan [ 135] . |13Uj . When applied to the inverse 
covariance, this decomposition is f2 = Ylk=i A* ® Bj where k is the Kronecker rank and 
A ? B; are linearly independent q x q and r x r matrices, called Kronecker factors. When 
the contextual information is that f2 has Kronecker rank k the number of free parameters 
in n is reduced from 0(p 2 ) to 0(k(q 2 + r 2 )). The simplest case occurs when fl is known to 
have Kronecker rank 1: fl = A(^)B, where A and B are symmetric positive definite q x q 
(spatial) and r x r (temporal) covariance matrices. Covariance estimation in this Kronecker 
model, also known as the matrix normal model, has been widely studied [34J, [HZ], [131]. A 
physical example of Kronecker-structured covariance is the case that X are outputs of a pas¬ 
sive antenna array that is sensing the EM emissions of k wide-sense-stationary non-moving 
targets in the far field of the array. In particular, if k — 1 there is a single target and the 
spatial and temporal components of the covariance are completely decoupled. In this case, 
the contextual information could be the number of targets. 
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Spatio-temporal contextual information: Kronecker+sparse precision matrix: 

Consider the case that the contextual information specifies that X has a precision matrix fl 
that is both Kronecker structured and has sparse Kronecker factors. When the Kronecker 
rank is k — 1 then f2 factors into A <g) B and there are P = 0(qsA + tsb) unknown param¬ 
eters, where sa and sb specify the sparsity in each of the Kronecker factors A and B |131j . 

EH 

The value of the different types of contextual information can be studied in terms of 
estimation of using theory developed recently in |131j . which discusses estimators and 
estimator convergence rates for the cases that f2 is, respectively, sparse, Kronecker, and 
sparse-Kronecker structured. These structural constraints are incorporated into the estima¬ 
tor Q of f2 using penalized maximum likelihood methods, which can be interpreted as MAP 
covariance estimators when the prior distributions on the entries of f2 are specified by the 
penalty functions. For each case denote the mean-squared error of the MAP estimator as the 
expectation of the Frobenius norm difference squared: MSE= _ZF[|| 17 — f2||p]. The relative 
decrease in the associated MSE due to any of the above pieces of contextual information is 
a function of the sample complexity for the inverse covariance estimation task. The forms of 
these MAP estimators, their asymptotic log MSE, and their asymptotic sample complexity 
can be obtained directly from [13 lj . |130j and are summarized below and in Table [2j 

For the saturated model (no side information about f2) the maximum likelihood estimator 
is equal to the inverse of the sample covariance E = n -1 Xl"=i( x * ~~ x )( x * — x) T , where x is 
the sample mean of {x,; }" =1 |37j. This is the MAP estimator under the uniform estimator- 
loss function and a uninformative (constant) prior on f2. When contextual information 
specifies that il is in fact sparse, the Glasso penalized ML estimator m adds an 11-norm 
penalty on Q, to the log-likelihood function. The Glasso precision estimator is the MAP 
estimator under a Laplacian-like prior on f2 and it can be determined using an iterative 
maximization algorithm. When contextual information is that f2 has Kronecker structure, 
again an iterative algorithm must be used to find the maximum likelihood estimator under 
the Kronecker model [ 146 j. This is the Bayes optimal estimator under a Kronecker covariance 
model and non-informative priors on the Kronecker factors. Finally, when the contextual 
information is that fl is both Kronecker and sparse the Kronecker log-likelihood model can 
be penalized by 11-norms on each of the Kronecker factors, resulting in a MAP estimator 
under a Laplacian-type prior on each of the factors EH, US]. 

The value-added brought to estimator performance by each of these contextual informa¬ 
tion sources can be assessed by studying the asymptotic sample complexity. The asymptotic 
sample complexity is defined by the number n = n p of samples, as a function of the di¬ 
mension p , required to maintain a given level of performance as p goes to infinity. Under 
general conditions on the MAP spatio-temporal covariance estimators, the log MSE takes 
the high dimensional (large q, r and large n ) asymptotic form shown in the 3rd row of Table 
|H2|. The second row of the table gives the form for the prior on f2 as (from left to right): 
1) uniform prior in the case of no contextual information; 2) sparse t\ prior in the case of 
sparsity information on f2; 3) Kronecker structured fl in the case of contextual information 
that decouples space and time (the prior (f(rankR(f2) — 1) is a delta function that forces 
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Information 

None 

Sparse 

Kronecker 

Kronecker+sparse 

Model 

saturated fl 

sparse fl 

fl = A® B 

sparse fl = A ® B 

log f(fl) 

constant 

AD i 

5 (rank77(f2) — 1) 

6 (rank77(fl) — 1) + Ai A i + A 2 B 2 

Bound 


+g(++) 

1 log ( (9 2+r2 ) lo g M j 

ii og (+y<«") 

Regime 

a 2 r 2 

1 — —y oi 

n 

qr log qr 
n 

(?*++) log M ^ 
n 

(g+r) log M 
n 


Table 2: Expressions for performance of MAP estimators (4th row) and the associated regimes 
of asymptotic sample complexity (5th row) for estimating the qr x qr spatio-temporal inverse cor¬ 
relation matrix fl = X! -1 = (cov(X)) -1 , associated with the q X r space-time Gaussian random 
matrix X, for different types of models of ft representing prior contextual information (2nd and 
3rd rows). The contextual information specifies patterns of sparsity and/or dependency between the 
q spatial coordinates and the r temporal coordinates. The bound is the asymptotic (large q, r and 
n) log Frobenius norm error of the Bayes-optimal MAP estimator of fl, for the different types of 
contextual information (1st row). The four rightmost columns of the table correspond respectively 
to: no contextual information about fl (None); information that fl is sparse, corresponding to X 
being a Gauss Markov random field (GMRF), with A > 0 sufficiently large so that the sparsity factor 
is of order o(qr); information that fl has (rank k = 1) Kronecker product structure (Kronecker), 
corresponding to decoupled rows and columns of X; and information that fl has (rank k = 1) 
Kronecker product structure with sparse Kronecker factors (Kronecker GMRF), with Ai,A 2 > 0 uf- 
ficicently large so that sparsity factors are of order o(max{q,r}). Comparing the sample complexity 
regimes (4th row) between the various types of contextual information quantifies the value of taking 
additional samples (See Fig. [^]j. 

kronecker structure fl = A ® D p] 4) Kronecker plus sparse fl. 

By quantifying the change in log MSE associated with different types of contextual in¬ 
formation (sparse, Kronecker, Kronecker+sparse) the value of taking additional samples can 
be determined across the contextual information regimes shown in Table [2} For the pur¬ 
pose of comparison we assume that q = r = ^/p, i.e., the spatial and temporal dimensions 
are identical. For the different categories of contextual information we fix the number of 
variables p and the log MSE. Figure [4] plots the level sets of constant log MSE over the 
number of samples and the number of variables. These curves indicate that the knowledge 
of both sparse and Kronecker structure is more valuable than knowledge of either sparse or 
Kroenecker structure alone. To illustrate, assume that X is a 100 x 10 matrix corresponding 
to 10 shapshots of 100 sensors. Then p = 1000 and, from the right panel of Fig. |4j, if the con¬ 
textual information specified that the inverse covariance has Kronecker and sparse structure 
the MAP estimator requires only n = 75 samples as compared to 4000 or 8000 samples if 
Kronecker or sparse structure is specified, and n — 1, 000, 000 samples if no information were 
available. The value of the information that the covariance is both sparse and Kronecker 
structure is that it decreases sampling requirements by more than 4 orders of magnitude 
relative to no contextual information! 

cl lZ is the permutation-rearrangement operator (131) ] that maps qr x qr matrix fl into the q 2 x r 2 matrix 
77 (fl), which has rank 1 if and only if fl = A <g> B for some q x q matrix A and r x r matrix B. 
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Figure 4: Sample complexity for estimating the qr x qr spatio-temporal correlation matrix for 
different types of prior contextual information shown in the 1st row of Table [1] for the case q = 
r = yjp, where yjp is a positive integer. The curves show asymptotic sample complexity (5th row 
in the Table), which are constant contours of the proxies (4th row of the Table) over the plane of 
the number p of variables and the number of samples n. The asymptotic proxies are equal over 
all the curves and along each curve. Curves to the left represent lower sample complexity and 
indicate the reduction in the required number of samples to attain a given level of log MSE for 
specified number of parameters. Here the contextual information represents knowledge that the 
inverse covariance has sparse structure alone (curve labeled “GMRF”), Kronecker structure alone 
(curve labeled “Kronecker”), vs Kronecker + GMRF structure. The curve labeled “No information” 
represents the case where there is no contextual side information about the inverse covariance. 
The curve labeled Kronecker GMRF dominates all of the others since information that the inverse 
covariance has both Kronecker and GMRF structure achieves maximal reduction in the number of 
free parameters and provides the highest value per sample. 

4 Correlation mining: model selection 


One of the primary goals of model selection in the correlation mining setting is to identify the 
support of the correlation or partial correlation matrix, i.e., identify the pairs of variables with 
non-zero correlations or partial correlations, from n measurements of the set ofp variables [8], 
[47j . [65], [56], [27], [95], [104] . [113] . [46], [87], [72], [101] . Model selection should be easier 
than estimating the values of all the correlations, discussed in Sec. [3} The original covariance 
selection problem [Tf] considers estimating inverse covariance matrices with zero entries 
under a multivariate Gaussian model for the observations. In recent years, the problem of 
estimating the support of sparse inverse covariance matrices has become a popular topic in 
the high dimensional statistics and machine learning literature. 


As discussed in Sec. 2.1, the problem of identifying the sparse patterning structure in 


the inverse covariance matrix Q is equivalent to identifying the graph Q associated with the 
non-zero entries of 0, and is thus also popularly known as graphical model selection. Various 
approaches have been proposed for identifying graphical models from high dimensional data. 
They can be categorized broadly into penalized likelihood methods and Bayesian methods. 
Popular Bayesian methods entail specifying priors on the space of sparse covariance or inverse 
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covariance matrices m, m, em], sis! and using Bayesian scoring rules to undertake model 
selection. Penalized likelihood methods in the context of (partial) correlation mining can be 
further categorized into a) Gaussian-based penalized likelihood methods i, B3, (B5|. 

and b) Pseudo-likelihood based or regression based methods [95J, llO'lj . ]113j . [T6j . [87], 

m imi . 

Recent work on penalized likelihood methods have focused on understanding both the 
computational complexity and sample complexity of model selection approaches. The for¬ 
mer entails computing the sparse inverse covariance estimate using G-penalized likelihood 
approaches and identifying the corresponding graph. This includes among others developing 
fast iterative algorithms for maximizing G-penalized likelihoods in order to obtain sparse 
inverse covariance estimates, quantifying the computational complexity of these algorithms, 
and deriving convergence rates of the iterative algorithms. In line with the main theme of 
this paper, the focus of this section will be to understand the sample complexity of model 
selection problems in the correlation mining context. 

Sample complexity of model selection approaches are generally stated in terms of sign 
consistency and estimation consistency (see ra, m- As in covariance estimation, the 
set-up is to let both the sample size n, and the dimension p — p n tend to infinity and 
establish large sample properties for a sequence of covariance parameters that is growing 
in dimension. We shall illustrate the sample complexity of model selection via a concrete 
example below. In particular, we present below a sample complexity result of a recently 
proposed graphical model selection method proposed in [72| called CONCORD. 

CONCORD seeks to maximize the following jointly convex objective function, called the 
pseudo-likelihood, as proposed in m 

v i P 

Qcon(O) . ^ ^ U log UJg T ~ ^ ^ H^iiYj T ^ ^ ^ij ~^j 11 2 "b ^ ^ ^ (7) 

i= 1 i =1 j^i 1 

where c Oij is the ij- th element of the p x p matrix O, and Y* denotes the i-th feature 
vector. Iterative optimization algorithms, along with aspects of computational complexity 
and algorithm convergence, are covered in m, EU, EM]. For sample complexity results, 
we shall follow very closely large sample results of the CONCORD graphical model selection 
approach as given in [72]. Both estimation consistency and oracle properties under suitable 
regularity conditions are stated below. The reader is referred to ra for further technical 
details. 

Let {D n } n >i denote the sequence of true underlying inverse covariance matrices and let 
the dimension p = p n vary with the sample size n. As in uni, assume the existence of 
accurate estimates of the diagonal entries <i<p n such that for any p > 0, there exists 

a constant C > 0 such that 


max \a nii — ua\ < C 

1 <i<Pu 

holds with probability larger than 1 — 0(n~ r] ). 
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For vectors u° G “ and u d G M+ n , let the notation C n (u)°, u ) d ) denote ^ £an evaluated 

at a matrix with off-diagonal entries uj° and diagonal entries co d . Let co° = (( iUn,ij))i<i<j< Pn 
denote the vector of off-diagonal entries of f2 n , and 6t Pn G M+ n denotes the vector with entries 
{■ a n ,a}i<i<p n ■ Let the sequence A n denote the set of non-zero entries in the vector uj°, and 
let q n = |AJ- Let 9 n ^ = for 1 < i < j < p n and define 9° = ((9 n>ij ))i<i<j< Pn e 

]^Pn(pn-i)/ 2 _ Also, let s n = min(jj) G _ 4 n u ni ij. Assume furthermore that the following three 
regularity conditions are met (i) the spectrum of Ct n is uniformly bounded from above and 
below, (ii) sub-Gaussianity of the data, (iii) the incoherence condition [95]. 

Under the above assumptions the following model selection result is established in [72]. 

Theorem 1 [7^ Suppose that assumptions (i), (ii), (iii) are satisfied. Suppose p n = 0{n K ) 
for some k > 0, q n = o {^/n/\ognj, ^ = 0 (\ n ), X n ^n/logn ->• oo, ->• oo 

and yfqf\ n —» 0, as n —>■ oo. Then there exists a constant C such that for any q > 0, the 
following events hold with probability at least 1 — 0{n~ v ). 

• There exists a minimizer = ((w n ,ij))i<i<j< Pn of the CONCORD objective function 
Q covfN j ■ 

• Any minimizer uj° of Q con (iv°, a n ) satisfies ||o;° — o>° 11 2 < C^fqfXn and sign(a5 nj j_j) = 
sign^y), V 1 < i < j < p n . 

The above result in [72] establishes model selection consistency (or rather sign consistency 
to be precise), and is in spirit similar to other model selection consistency results in the 
literature (see also [[95] and m)- A few remarks are in order with regards to sample 
complexity. First, note that model selection consistency requires that both the sample n and 
dimension p n tend to infinity. Hence asymptotic guarantees require large sample sizes and 
thus model selection consistency may not be valid in sample starved settings. Second, results 
in model selection consistency are often proved under the assumption of sub-Gaussianity of 
the tails, and thus may be restrictive in many applications with heavy-tailed data. Third, 
note that the dimension p n can grow faster than the sample size n, but cannot grow faster 
than a polynomial rate. 


5 Correlation mining: screening 

In correlation screening one seeks to discover patterns of high correlation or partial corre¬ 
lation between p variables based on a set of n observations [BTj, [5S]- Stated in terms of a 
Gaussian graphical model, the objective is to infer topological characteristics of the graph 
Q associated with the zeros in the precision matrix fl. In JBTj, we treated the problem of 
screening for the presence of variables with high correlations to other variables. In [58] . 
we considered the setting of screening for the presence of connected nodes and hubs in Q 
with high partial correlation. Screening for such topology characteristics of Q should be 
easier than model selection or covariance estimation. Similarly to what was demonstrated 
for covariance estimation (recall Table [ 2 ]), contextual information can be of high value for 
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screening. For example, one may be given information that specifies a certain sparsity level 
or block diagonal structure of the inverse covariance, or information on the minimum level 
of correlation that exists among the active variables in a block. 

The correlation screening method of [58] finds edges, hubs, and other subgraph structures 
of Q by performing hypothesis testing. The method applies a threshold to an empirical 
estimate P of the partial correlation matrix P. defined in (jij), placing an edge in Q where 
the magnitude of the entry of P exceeds the threshold. When n > p, P may be the simple 
plug-in estimator of the matrix inverse of the sample correlation estimator, while if n < p the 
correlation screening methods developed in [58] uses the Moore-Penrose generalized inverse 
of the sample correlation estimator. Correlation screening has been studied and applied to 
hub discovery [SB], edge discovery [53] and classification of local node degree [JTj in a variety 
of graphical model applications including: stationary Gaussian spatio-temporal processes 
models [39], [40]; sparse regression models [37]; and multiple model testing for common 
sparsity patterns [61]. 

The computational complexity of correlation screening is much lower than model selection 
or covariance estimation, only of the order of 0(n 2 ), in the sample starved case of n C p. 
To illustrate, the sample complexity of screening edges in Q can be determined from the 
following theorem (adapted from [58], Prop. 2], see also [63]): 

Theorem 2 Assume that the n samples {Xfc}^ =1 are i.i.d. random vectors in ML with 
bounded elliptically contoured density and block sparse p x p covariance matrix. Let p be the 
threshold applied to the sample partial correlation matrix P. Assume that p goes to oo and 
p = p p goes to one at a rate specified by the relation linip^oopQo — 1)(1 — p 2 )G- 2 )/ 2 — en; 
where e n G (0, oo) is given. Then the probability P e that there exists at least one false edge 
in Q satisfies 

lim P e = 1 - exp(— k„,/2), (8) 

p—>CO 

where 

K n 2 ) 

where a n is the volume of the n — 2 dimensional unit sphere in It"' -1 and is given by a n = 
r((n-l)/ 2 ) 

V¥r((n-2)/2) ’ 

As in other types of hypothesis testing problems, two types of correlation screening errors 
can occur: false positives (Type I) and false negatives (Type II). It is common for an exper¬ 
imenter to constrain the false positive rate to ensure a certain level of Type I error control. 
Remarkably, Thm. [2] asserts that, under the stated conditions, the large p false positive rate 
does not depend on the true covariance E. In this large p case the correlation threshold p can 
be set to attain a given level of false positive control. This fortunate situation is analogous 
to constant false alarm rate (CFAR) signal detection in radar processing [120] , 1 121 .j 110 :. In 
[61 J it was shown that, when screening a large number p of variables, the false positive rate 
undergoes a fundamental phase transition as a function of the applied threshold: the rate 
precipitously increases from almost to zero to one as the correlation threshold is decreased 
beyond a critical threshold p c . A direct consequence of Thm. [2] is that p c has the form [581 
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Eq. (10)]: 


Pc = /l-KIp-lll-VM, (9) 

When the applied threshold is greater than p c there will be few false positives while when it 
is below p c the system will be inundated by false positives. 

Using and a large p approximation [58j Eq. (22)] to the false positive rate following 
directly from Tlnn. [2j we can quantify the intrinsic value of taking an additional sample 
when the task is to detect variables that have true correlations exceeding a given threshold 
p. Assume that p is fixed but large. Figures [5] and [6] gives a family of design curves that can 
be used by the system designer to right-size n for given p and given desired correlation level 
P- 




Figure 5: Correlation screening curves quantifying value of information associated with acquiring 
more samples n for different parameter dimensions (p = 100,10, 000, and 10,000, 000,000 ) in terms 
of the minimum detectable correlation value p. The screening task is to detect variables that have 
high correlation (greater than p) to at least one other variable. These curves specify the minimum 
required number of samples (n) for reliable detection of such variables at given family wise false 
positive error rates. For example, for ten billion (10 10 ) variables at least 200 samples are required to 
reliably detect a variable having correlation greater than p = 0.6, while fewer than half the number 
of samples would be needed to detect the same level of correlation if there were only ten thousand 
variables. Thus, for the correlation screening task, the value of a sample is much higher when 
there are fewer variables, and the displayed curves quantify this value. The curves in left panel 
are isoclines on the probability of error surface for fixed family wise error rate (FWER) equal to 
0.0001. The curves in the right panel are similar except that they are isoclinal for fixed mean false 
positive rate of 1 (only 1 false positive node detected out of p nodes). The respective FWER’s of 
false positive probability are designated on each curve in the right panel. The curves in the left 
and right panel are very similar since the probability of error surface undergoes an abrupt phase 
transition from 0 to 1. 
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Figure 6: Correlation screening curves quantifying value of information associated with acquiring 
more samples n for different minimal detectable correlation levels (p = 0.3,0.4, 0.5, 0.6,0.7,0.8j in 
terms of the parameter dimension p. The screening task is the same as in Fig. [5| but the phase 
transition is plotted differently to reveal the value of information for detecting different fixed levels 
of correlation for varying numbers of parameters p. Note that the number of samples n required 
for reliably detecting variables with high correlations, e.g., p = 0.8, increases much more slowly as 
p increases than it does for small correlations, e.g., p = 0.3. Thus, as the desired correlation level 
increases, there is a diminishing return in the value of information delivered by acquiring additional 
samples. 


6 Correlation mining: intrinsic sample complexity regimes 


The experimenter is often faced with several correlation mining tasks, possibly performed in 
sequence. For example, detection of existence of high correlations among p variables may be 
followed by identification of the set of highly correlated variables, followed by estimation of 
the values of their correlations, followed by specification of the uncertainty (confidence inter¬ 
vals) associated with these estimates. Reliable accomplishment of each task becomes more 
difficult as one progresses from detection to uncertainty quantification, requiring progres¬ 
sively larger numbers of samples and progressively smaller critical phase transition thresh¬ 
olds p c . Establishing the sampling regimes associated with each one of these tasks is one of 
the fundamental problems of large scale inference and data science. 

Recall that the asymptotic sample complexity associated with an inference task is the 
number n = n p of samples, as a function of the dimension p, required to maintain a given 
value of risk as p goes to infinity. Table [3] summarizes the sample complexity regimes (3rd 
row) and critical phase transition threshold regimes (4th row) for tasks relevant to correlation 
mining. These are discussed below in order of increasing sample complexity. 

• Screening: The screening task is to use the sample correlation to detect the existence 
of high correlations, by which we mean large values in either the population correlation 
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or partial correlation matrix; equivalently to detect the existence of an edge in the 
correlation or partial correlation network, as discussed in Sec. [5] As such, it is a binary 
hypothesis testing problem having risk function equal to the false positive probability 
under the null hypothesis Hq of a sparse and invertible population covariance matrix 
X. For the threshold-based correlation screening method of [6TJ the false positive 
probability is P(N e > 0), where N e is the number of entries (edges) of the sample 
partial correlation matrix that exceed a threshold p. The bound is the asymptotic 
limiting value of the false positive edge probability specified in Thm. [2] The sample 
complexity regime is: n=£xed (not a function of p ) while p —> oo, denoted in Table [3] 


as 


logP 


—> oo 


. The critical threshold § converges to 1 as p —» oo. 


• Detection: The detection task is the same as the screening task except that both high 
and low correlations are of interest. The experimenter specifies a threshold p £ (0,1) 
and the objective is to find correlations of magnitude at least p. The sample complexity 
regime for this problem is: n increases to infinity with p at the asymptotic rate = cc, 
where a £ (0, oo) is a constant. The critical threshold ([9]) converges to a constant p* 
satisfying 0 < p* < p as p —>■ oo. 


• Support recovery: This is the problem of model selection discussed in Sec. [4] where 
the objective is to identify the support set S C p} of the population inverse 

covariance f2. If S denotes an estimator of this support set, the risk function is the 
probability that indices in S are missing in S or that indices in {1,... ,p} C S are 
erroneously included in S, denoted P(card{5A5} = 0) where AAB is the symmetric 
difference between sets A and B. Assume it is a priori known that the cardinality of 
S is at most fc, where 1 < k < p. A finite sample upper bound follows by applying 
the union bound over the possible subsets of {1,... ,p} of cardinality at most k to 
obtain: P(card{5A5} = </>)< Xu=o {T ) e where /3 is the minimum Kullback- 
Liebler divergence between these subsets. The well known bound [3] on partial sums of 
binomial coefficients Yli=o (i) — 2 H ^ k ^ p , where H(e) = —eloge— (1 — e) log(l —e), can 
then be used along with the representation H{k/p)p = p v for some v £ (0,1] to obtain 
the risk bound: 2 pV e~^ n . Therefore, the limiting regime of values (n,p) for which this 
bound is constant gives the sample complexity: n increases to infinity with p at the 
asymptotic rate — = a, where v £ (0,1) and a £ (0, oo). Note that this is consistent 
with the rate p = 0(n K ) of the CONCORD support recovery algorithm, given in Thm. 
[lj for k, > 1, the regime of where the number of samples is sufficient for model selection 
but not for parameter estimation. In this regime, the critical threshold Q converges 
to zero as p —> oo. Note that the asymptotic rates reported in Table [3] for support 
recovery (or equivalently model selection) appear to be slower than what has been 
derived in the statistics literature. In particular, results are available which assert that 
provided (log p)/n —> 0, support recovery is possible with probability tending to 1 (see 
[114] for more details). We note that the these faster rates are a direct consequence of 
assuming that the variables are either Gaussian or sub-Gaussian, or by imposing some 
tail condition m- When such conditions are relaxed, the asymptotic rates coincide 
with the regimes given in Table [3] (see also mm for details on convergence in matrix 
norms when Gaussianity is relaxed). 
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• Parameter estimation: The problem of parameter estimation is to determine the 
individual values of the ( p ) entries in f2. The risk function is the MSE, defined as 
the mean Frobenius norm squared error i?[||f2 — between the population inverse 
covariance and an empirical estimator. The bound is the high dimensional limiting 
value of this MSE as n —> oo and p —> oo na. The sample complexity for this problem 
is: n increases to inhnity with p at the asymptotic rate pl ° gp = a, where a G (0, oo) is 
a constant. Again the critical threshold (J9]) converges to zero as p —» oo. 

• Performance estimation: We consider the most general (and stringent) setting for 
performance estimation where, for a specified Borel set B C R p , the probability P(X G 
B) must be accurately estimated. For example, assuming that X has a zero mean 
clliptically contoured density /(x), if B is the set {x G R p : ||x|| > 7 } for 7 > 0, 
P(X G B) is the critical region for optimally rejecting outliers and detecting anomalies 
m,m and the value 7 = ||X|| can be used to estimate the p-value associated with 
the null hypothesis that X is not an outlier. The sample complexity of estimation 
of P(X G B), for all £>, is equivalent to that of estimation of the density function 
/(x). Therefore, we adopt the mean integrated squared error (MISE) [ 132] as the risk 
function: f E[(fn(x) — /(x)) 2 ]dx where / is an empirical density estimator. It is known 
that if / is in the class of Lipschitz functions the minimax MISE risk is of the form 
[321: /dn~ 2/(1+p \ j3 > 0. Hence we use this minimax risk as a proxy for performance 
and the sample complexity for this problem is: n increases to inhnity with p at the 
asymptotic rate = a, where a G (0, cx)) is a constant. The critical threshold (|9j) 
again converges to zero as p —> 00 . 

We conclude this section with a comparison of computational complexity. Unlike pre¬ 
diction and model selection, correlation screening methods are scalable to very high dimen¬ 
sions both in terms of computation and memory scalability. Popular sparse optimization 
approaches to covariance estimation and covariance selection are iterative and include pe¬ 
nalized likelihood methods such as Glasso [43], SPACE [TUBj and CONCORD [71JJ . The 
computational complexity of Glasso after t iterations is of order 0{tp 3 ). This can be reduced 
when using regression based methods such as SPACE and CONCORD which have a com¬ 
putational complexity of order min {O(tnp 2 ), O(tp 3 )}. In contrast, correlation and partial 
correlation screening are non-iterative algorithms and the computational complexity is only 
of order 0(np log p) which can be considerably less than its penalized likelihood counter¬ 
parts. The lower order 0{np log p) is due to the fact that building the thresholded sample 
covariance is equivalent to constructing a Euclidean ball graph over p nodes in n — 1 dimen¬ 
sional space, for which reliable approximate nearest neighbors (ANN) algorithms [5] can be 
applied. Very fast and scalable C, Python and Matlab implementations of ANN algorithms 
are available, e.g., FLANN, Gensim, and annoy, which have been implemented on datasets 
with p in the millions and n in the hundreds. 
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Tabic 3: Different sample complexity regimes characterize the difficulty of performing different 
inference tasks on the inverse covariance Cl. A similar table would hold for inference on the co- 
variance S. Task (1st row) specified risk function (2nd row) for which we can use the upper bound 
(3rd row). The sample complexity regimes (4th row) for which the bound remains constant depends 
on the task: increasingly large sample sizes are required as the complexity of the task increases from 
left to right. The limiting value of the critical threshold p c , defined in &■ for screening (5th row) is 
shown for each regime. For the screening task (detection of existence of large partial correlations) 
the bound is the purely high dimensional (large p and fixed n) asymptotic limit of the probability of 
false positives associated with the test that at least one variable is highly correlated (given by The¬ 
orem 1). For the detection task (detection of existence of partial correlation of magnitude greater 
than p £ (0,1 )) the bound is the mixed high dimensional (large p and large n) asymptotic limit of the 
probability of false positives. In this regime, the critical threshold converges to p* where 0 < p* < p. 
For the support recovery task (model selection) the bound is the mixed high dimensional asymptotic 
probability of mis classification of the set of partial correlations using CONCORD. For this task 
the parameter v is specified by a priori knowledge on sparsity of the support set: u £ (0,1] and 
as the sparsity increases v —> 0. For the estimation task the bound is the mixed high dimensional 
asymptotic squared Frobenius norm error on the MAP estimator of sparse inverse covariance. Fi¬ 
nally, for the performance estimation task, the bound is the mixed high dimensional asymptotic 
minmax bound on estimation MSE of the probability of any (Borel) uncertainty set. The values of 
the constants a and /3 are not necessarily the same over different columns of the table. 
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7 Conclusions 


Big data is not just lots of data. This monolithic characterization is overly simplistic and 
ignores the issues of inference, limited samples, and reproducibility. Big data is of limited 
utility without appropriate inferential tools, e.g., use of the dataset to produce empirical 
estimates, classifications, or decisions on the population that generated the data. Inferences 
in turn lack credibility without accounting for errors due to limited samples. Without credible 
inferences there is no reproducibility: another random sample from the sample population 
may produce completely different results. 

This paper adopted a statistical perspective in which a large scale dataset is a set of n 
random samples drawn from a population of p variables and p is large. We focused on the 
problem of correlation mining where the objective is to infer properties of the population 
covariance matrix from the samples. The reliability of the inferences from limited samples 
can be mathematically characterized by the high dimensional learning rates and sample 
complexity associated with the inference problem. These specify the relative rate at which 
n must go to infinity as a function of p in order to ensure accurate performance. The sample 
complexity falls into different high dimensional regimes including the classical regime, where 
p is fixed and n goes to infinity, the mixed dimensional where both n and p goes to infinity, 
and the purely high dimensional where n is fixed and p goes to infinity. 

The comparative sampling complexity analysis illustrated in this paper unifies the prob¬ 
lem of sample sizing for large scale inference problems and, in particular, for correlation 
mining. Indeed, different sample complexity regimes each occupy a niche for different corre¬ 
lation mining tasks. In particular, screening for high correlations is governed by purely high 
dimensional rates while model selection, covariance estimation, and uncertainty quantifica¬ 
tion require n to go to infinity at progressively larger rates as a function of p. This implies, 
for example, that one can do screening with many fewer samples than are required for the 
other tasks. Furthermore, in situations where samples are acquired sequentially our analysis 
suggests that that one can adapt the inference task over time: starting with correlation 
screening when samples are few, and progressing on through support detection, covariance 
estimation, and uncertainty quantification as more and more samples are acquired. Such 
a strategy is explored in the context of the sequential prediction and regression criterion 
(SPARC) [37]. 
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List of Figures 

1 The Gaussian graphical model obtained from a finite difference approximation to 

the Poisson equation has only local spatial dependency (left panel) resulting in a 
pentadiagonal inverse covariance matrix (right panel) . |U] 

2 Two (n = 2) independent realizations of the discretized Poisson random field ob¬ 

tained from the finite difference approximation 0 with N\ = N 2 = 90. Here the 
driving process W is a zero mean spatio-temporally white Gaussian noise . m 

3 The support of the thresholded sample partial correlation matrix, rendered as a graph 
over spatial locations in the plane, for a Gaussian random field generated by the 
Poisson equation on a 30 x 30 grid (p = 900J. The sample partial correlation P = 
diag(0) _1 / 2 0diag(0) _1 / 2 , where fi = S ” 1 the inverse of the sample covariance 
matrix, was computed using n = 1500 samples. The graph is shown for six different 
threshold levels p applied to P. The true cartesian support set is recovered for 
p = 0.26114. There is a sharp increase in the number of false edges as the threshold 
p decreases below a certain threshold, somewhere between 0.0791 and 0.13978. The 
location of this threshold appears well approximated by the theory m that predicts 

a critical threshold value 0.091 (See equation 0 in Sec. [5| of this paper) . [TT] 

4 Sample complexity for estimating the qr x qr spatio-temporal correlation matrix for 
different types of prior contextual information shown in the 1st row of Tabled for 
the case q = r = v / p, where y/p is a positive integer. The curves show asymp¬ 
totic sample complexity (5th row in the Table), which are constant contours of the 
proxies ( 4 th row of the Table) over the plane of the number p of variables and 
the number of samples n. The asymptotic proxies are equal over all the curves 
and along each curve. Curves to the left represent lower sample complexity and 
indicate the reduction in the required number of samples to attain a given level 
of log MSE for specified number of parameters. Here the contextual information 
represents knowledge that the inverse covariance has sparse structure alone (curve 
labeled “GMRF”), Kronecker structure alone (curve labeled “Kronecker”), vs Kro- 
necker+GMRF structure. The curve labeled “No information” represents the case 
where there is no contextual side information about the inverse covariance. The 
curve labeled Kronecker GMRF dominates all of the others since information that 
the inverse covariance has both Kronecker and GMRF structure achieves maximal 
reduction in the number of free parameters and provides the highest value per sample. 1151 
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5 Correlation screening curves quantifying value of information associated with ac¬ 
quiring more samples n for different parameter dimensions (p = 100,10, 000, and 
10,000,000,000,) in terms of the minimum detectable correlation value p. The 
screening task is to detect variables that have high correlation (greater than p) to 
at least one other variable. These curves specify the minimum required number of 
samples (n) for reliable detection of such variables at given family wise false positive 
error rates. For example, for ten billion (1Q W ) variables at least 200 samples are 
required to reliably detect a variable having correlation greater than p = 0.6, while 
fewer than half the number of samples would be needed to detect the same level 
of correlation if there were only ten thousand variables. Thus, for the correlation 
screening task, the value of a sample is much higher when there are fewer variables, 
and the displayed curves quantify this value. The curves in left panel are isoclines 
on the probability of error surface for fixed family wise error rate (FWER) equal 
to 0.0001. The curves in the right panel are similar except that they are isoclinal 
for fixed mean false positive rate of 1 (only 1 false positive node detected out of p 
nodes). The respective FWER’s of false positive probability are designated on each 
curve in the right panel. The curves in the left and right panel are very similar 
since the probability of error surface undergoes an abrupt phase transition from 0 

to 1. da 

6 Correlation screening curves quantifying value of information associated with ac¬ 
quiring more samples n for different minimal detectable correlation levels (p = 

0.3,0.4,0.5, 0.6, 0.7, 0.8) in terms of the parameter dimension p. The screening task 
is the same as in Fig. [5| but the phase transition is plotted differently to reveal the 
value of information for detecting different fixed levels of correlation for varying 
numbers of parameters p. Note that the number of samples n required for reliably 
detecting variables with high correlations, e.g., p = 0.8, increases much more slowly 
as p increases than it does for small correlations, e.g., p = 0.3. Thus, as the desired 
correlation level increases, there is a diminishing return in the value of information 
delivered by acquiring additional samples . Ea 
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